Welcome

Placeholder

What you will learn

Skills

Data

What you won’t learn

Simple vs Easy

How to Contribute

About the author

Jacob Kaplan (http://crimedatatool.com/) is a Ph.D. candidate in criminology at the University of Pennsylvania. His research focuses on Crime Prevention Through Environmental Design (CPTED), specifically on the effect of outdoor lighting on crime. He is the author of several R packages, such as asciiSetupReader, fastDummies, and boxoffice. His website Crime Data Tool allows easy analysis of crime-related data and he has released over a dozen crime data sets (primarily FBI UCR data) on openICPSR. He is currently on the job market.

For a list of papers he has written (including working papers), please see here.

For a list of data sets he has cleaned, aggregated, and make public, please see here.

1 Introduction to R and RStudio

Placeholder

1.1 Why learn to program?

1.1.1 Scale

1.1.2 Reproducibility

1.2 Using RStudio

1.2.1 Opening an R Script

1.2.2 Setting the working directory

1.2.3 Changing RStudio

1.2.3.1 General

1.2.3.2 Code

1.2.3.3 Appearance

1.2.3.4 Pane Layout

1.2.4 Helpful cheatsheets

1.3 Reading data into R

1.3.1 Loading data

1.4 First steps to exploring data

1.5 Finding help about functions

(PART) Clean

2 Subsetting: Making big things small

Placeholder

2.1 Select specific values

2.2 Assignment values to objects (Making “things”)

2.3 Vectors (collections of “things”)

2.4 Logical values and operations

2.4.1 Matching a single value

2.4.2 Matching multiple values

2.4.3 Does not match

2.4.4 Greater than or less than

2.4.5 Combining conditional statements - or, and

2.5 Subsetting a data.frame

2.5.1 Select specific columns

2.5.2 Select specific rows

2.5.3 Battleships

2.5.4 Subset Colorado data

3 Exploratory data analysis

Placeholder

3.1 Summary and Table

3.2 Graphing

3.3 Aggregating (summaries of groups)

4 Dates and times

Placeholder

4.1 Why do dates and times matter?

4.2 lubridate

4.3 Working with dates

4.4 Chicago crime data

4.4.1 Exercises

5 Regular Expressions

Placeholder

5.1 Finding patterns in text with grep()

5.2 Finding and replacing patterns in text with gsub()

5.3 Useful special characters

5.3.1 Multiple characters []

5.3.2 n-many of previous character {n}

5.3.3 n-many to m-many of previous character {n,m}

5.3.4 Start of string and “not” ^

5.3.5 End of string $

5.3.6 Anything .

5.3.7 One or more of previous +

5.3.8 Zero or more of previous *

5.3.9 Multiple patterns |

5.3.10 Parentheses ()

5.3.11 Optional text ?

5.4 Changing capitalization

(PART) Collect

6 Webscraping with rvest

Placeholder

6.1 Scraping one page

6.2 Cleaning the webscraped data

6.3 Fixing names

6.3.1 Exercises

7 Functions

Placeholder

7.1 A simple function

7.2 Adding parameters

7.3 Making a function to scrape movie data

8 For loops

Placeholder

8.1 Basic for loops

8.2 Scraping multiple days of movie data

9 Reading and Writing Data

Placeholder

9.1 Reading Data into R

9.1.1 R

9.1.2 Excel

9.1.3 Stata

9.1.4 SAS

9.1.5 SPSS

9.2 Writing Data

9.2.1 R

9.2.2 Excel

9.2.3 Stata

9.2.4 SAS

9.2.5 SPSS

10 Scraping data from PDFs

Placeholder

10.1 Downloading officer-involved Shooting Files

10.2 Scraping information from the page

10.2.1 Combining the data sets

10.3 Extracting data from PDFs

10.3.1 Scraping a single PDF

10.3.2 Making a function

10.3.3 Looping through every PDF

11 Scraping Tables from PDFs

Placeholder

11.1 Scraping the first table

11.2 Making a function

12 Geocoding

Placeholder

12.1 Geocoding a single address

12.2 Making a function

12.3 Geocoding officer shooting locations

(PART) Visualize

13 Graphing with ggplot2

Placeholder

13.1 What does the data look like?

13.2 Graphing data

13.3 Time-Series Plots

13.4 Color blindness

14 Hotspot maps

Placeholder

14.1 A simple map

14.2 What really are maps?

14.3 Making a hotspot map

14.3.1 Colors

14.4 Looping through each year

15 Choropleth maps

Placeholder

15.1 Spatial joins

15.2 Making choropleth maps

16 Interactive maps

While maps of data are useful, their ability to show incident-level information is quite limited. They tend to show broad trends - where crime happened in a city - rather than provide information about specific crime incidents. While broad trends are important, there are significant drawbacks about being unable to get important information about an incident without having to check the data. An interactive map bridges this gap by showing trends while allowing you to zoom into individual incidents and see information about each incident.

For this lesson we will continue to use the officer shooting data so let’s load that.

16.1 Why do interactive graphs matter?

16.1.1 Understanding your data

The most important thing to learn from this course is that understanding your data is crucial to good research. Making interactive maps is a very useful way to better understand your data as you can immediately see geographic patterns and quickly look at characteristics of those incidents to understand them.

In this lesson we will make a map of each officer-involved shooting that lets you click on the shooting and see some information about it. If we see a cluster of shootings, we can click on each shooting to see if they are similar. Though it is possible to find these patterns just looking at the data, it is easier to be able to see a geographic pattern and immediately look at information about each incident.

16.1.2 Police departments use them

Interactive maps are popular in large police departments such as Philadelphia and New York City. They allow easy understanding of geographic patterns in the data and, importantly, allow such access to people who do not have the technical skills necessary to create the maps. If nothing else, learning interactive maps will help you with a future job.

16.2 Making the interactive map

As usual, let’s take a look at the top 6 rows of the data.

This data is fairly sparse about information regarding the shooting. All it has is the date , shooting number, and address (which isn’t that useful as location is already covered by the map). The level of detail about the crime may be sparse, but we can still create a map where you can click an incident dot on the map and a popup will tell you when it happened.

We will use the package leaflet for our interactive map. leaflet produces maps similar to Google Maps with circles (or any icon we choose) for each value we add to the map. It allows you to zoom in, scroll around, and provides context to each incident that isn’t available on a static map.

To make a leaflet map we need to run the function leaflet() and add a tile to the map. A tile is simply the background of the map. This website provides a large number of potential tiles to use, though many are not relevant to our purposes of crime mapping.

We will use a standard tile from Open Street Maps. This tile gives street names and highlights important features such has parks and large stores which provides useful contexts for looking at the data. The attribution parameter isn’t strictly necessary but it is good form to say where your tile is from.

When you run the above code it shows a world map (copied several times). Zoom into it and it’ll start showing relevant features of wherever you’re looking.

Note the %>% between the leaflet() function and the addTiles() function. This is called a “pipe” in R and is used like the + in ggplot() to combine multiple functions together. This is used heavily in what is called the “tidyverse”, a series of packages that are prominent in modern R and useful for data analysis. We won’t be covering them in this book but for more information on them you can check the tidyverse website. For this lesson you need to know that each piece of the leaflet function must end with %>% for the next line to work.

To add the points to the graph we use the function addMarkers() which has two parameters, lng and lat. For both parameters we put the column in which the longitude and latitude are, respectively.

It now adds an icon indicating where every shooting in our data is. You can zoom in and scroll around to see more about where the shootings happen. These icons are a bit large, covering nearly all of the city and making it hard to see where shootings happen. To change the icons to circles we can change the function addMarkers() to addCircleMarkers(), keeping the rest of the code the same,

This makes the icon into circles but they are still large and cover most of the map. To adjust the size of our icons we use the radius parameter in addMarkers() or addCircleMarkers(). The larger the radius, the larger the icons.

Setting the radius option to 5 shrinks the size of the icon a lot. In your own maps you’ll have to fiddle with this option to get it to look the way you want. Let’s move on to adding information about each icon when clicked upon.

16.3 Adding popup information

The parameter popup in the addMarkers() or addCircleMarkers() functions lets you input a character value (if not already a character value it will convert it to one) and that will be shown as a popup when you click on the icon. Let’s start simple here by inputting the dates column in our data and then build it up to a more complicated popup.

Try clicking around and you’ll see that the data of the incident you clicked on appears over the dot. Though fairly clear in this case, we usually want to have a title indicating what the value in the popup means. We can do this by using the paste() function to combine text explaining the value with the value itself. Let’s add the words “Date of Shooting:” before the date.

We don’t have many other columns but we can add the location and shooting number to the popup by adding them to the paste() function we’re using.

Just adding the location text makes it try to print out everything on one line which is hard to read. If we add the text <br> where we want a line break it will make one. <br> is the HTML tag for line-break which is why it works making a new line in this case.

16.4 Dealing with too many markers

Even though we shrunk the size of the circles, it is still rather hard to see any trends as there are so many incidents and relatively large circles. One solution is to keep shrinking the size of the circles, but this quickly becomes a bad solution when using more frequent data such as a crime data set (Philadelphia data alone has about 200k crimes reported per year). The other solution is to cluster the data into groups where the dots only show if you zoom down.

If we add the code clusterOptions = markerClusterOptions() to our addCircleMarkers() it will cluster for us.

Incidents close to each other are grouped together in fairly arbitrary groupings and we can see how large each grouping is by moving our cursor over the circle. Click on a circle or zoom in and and it will show smaller groupings at lower levels of aggregation. Keep clicking or zooming in and it will eventually show each incident as its own circle.

This method is very useful for dealing with huge amounts of data as it avoids overflowing the map with too many icons at one time. A downside, however, is that the clusters are created arbitrarily meaning that important context, such as neighborhood, can be lost.

16.5 Interactive choropleth maps

In Chapter @ref(choropleth-maps) we worked on choropleth maps which are maps with shaded regions, such as states colored by which political party won them in an election. Here we will make interactive choropleth maps where you can click on a shaded region and see information about that region. We’ll make the same map as before - Census tracts with the number of officer-involved shootings.

Let’s load the tract-level officer-involved shooting data we made earlier.

We’ll begin the leaflet map similar to before but use the function addPolygons() and our input here is the geometry column of philly_tracts_shootings.

It gives us a blank map because our polygons are projected to Philly’s projection while the leaflet map expects the standard CRS, WGS84 which uses longitude and latitude. So we need to change our projection to that using the st_transform() function from the sf package.

Now let’s try again.

It made a map with large blue lines indicating each tract. Let’s change the appearance of the graph a bit before making a popup or shading the tracts. The parameter color in addPolygons() changes the color of the lines - let’s change it to black. The lines are also very large, blurring into each other and making the tracts hard to see. We can change the weight parameter to alter the size of these lines - smaller values are smaller lines. Let’s try setting this to 1.

That looks better and we can clearly distinguish each tract now.

As we did earlier, we can add the popup text directly to the function which makes the geographic shapes, in this case addPolygons(). Let’s add the GEOID10 column value - the unique ID code for that tract - and the number of shootings that occurred in that tract. As before when we click on a tract a popup appears with the output we specified.

For these types of maps we generally want to shade each polygon to indicate how frequently the event occurs in the polygon. For this process we will make a simple function which will automatically shade the tracts by the value in the column we want it shaded by - number_shootings.

We’ll use the function colorNumeric() to make our colors, which takes a lot of the work out of this process. This function takes two inputs, first a color palette which we can get from the site colorbrewer2. Let’s use the fourth bar in the Sequential page, which is light orange to red. If you look in the section with each HEX value it says that the palette is “3-class OrRd”. The “3-class” just means we selected 3 colors, the “OrRd” is the part we want. That will tell colorNumeric() to make the palette using these colors. The second parameter is the column for our numeric variable, number_shootings.

We will save the output of colorNumeric("OrRd", philly_tracts_shootings$number_shootings) as a new variable which we’ll call pal for convenience. Then inside of addPolygons() we’ll set the parameter fillColor to pal(philly_tracts_shootings$number_shootings), running this function on the column. What this really does it determine which color every tract should be based on the value in the number_shootings column.

Since the tracts are transparent, it is hard to distinguish which color is shown. We can make each tract a solid color by setting the parameter fillOpacity inside of addPolygons() to 1.

To add a legend to this we use the function addLegend() which takes three parameters. pal asks which color palette we are using - we want it to be the exact same as we use to color the tracts so we’ll use the pal object we made. The values parameter is used for which column our numeric values are from, in our case the number_shootings column so we’ll input that. Finally opacity determines how transparent the legend will be. As each tract is set to not be transparent at all, we’ll also set this to 1.

Finally, we can add a title to the legend using the title parameter inside of addLegend().

17 More graphing with ggplot

Placeholder

17.1 Graphing a single variable

17.1.1 Numeric variable

17.1.2 Categorical variable

17.2 Time Series

18 R Markdown

Placeholder

18.1 Code

18.1.1 Hiding code in the output

18.2 Tables

18.3 Making the output file

(PART) Data

19 Introduction

At this point you have learned how to read in data, manipulate it to get just the parts you want or to aggregate it to the level you want, and visualize it through maps or graphs. You’ve done so using data sets that are commonly used in criminological research.

In the next several chapters we will be introducing a number of other data sets - or looking deeper into data we’ve already seen - that are common in criminology. While these chapters do use R a bit to explore or read in the data, they are primarily a discussion of the trade-offs of using each data set. Some of the data sets are difficult to read into R, requiring more steps than you may be useful, so these chapters will also discuss how to get that data into R.

20 Uniform Crime Report (UCR) Data - Offenses Known and Clearances by Arrest

Placeholder

20.1 Exploring the UCR data

20.2 ORIs - Unique agency identifiers

20.3 Hierarchy Rule

20.4 Which crimes are included

20.4.1 Index Crimes

20.4.2 The problem with using index crimes

20.4.3 Rape definition change

20.5 Actual offenses, clearances, and unfounded offenses

20.5.1 Actual

20.5.2 Total Cleared

20.5.3 Cleared Where All Offenders Are Under 18

20.5.4 Unfounded

20.6 Number of months reported

21 Census data from Social Explorer

Placeholder

21.1 Getting Census data from Social Explorer

22 Census data from IPUMS

Placeholder

22.1 Getting IPUMS data

22.2 Cleaning the data

22.3 Aggregating the data

22.4 Graphing the data

22.5 Mapping the data

23 National Incident-Based Reporting System (NIBRS) Data

Placeholder

23.1 Downloading the data

23.2 Reading the data

(APPENDIX) Appendix

24 Useful resources

24.0.1 Learning R and coding issues

R for Data Science - This free online book provides a good introduction for R though it differs in several important ways from this class.

Stack Overflow - Stack Overflow is a website that answers programming-related questions. It’s like the Yahoo Answers of programming. That said, a lot of the answer are bad. Some answers are overly confusing or provide code that you may not understand. You can use this source, but don’t rely too heavily on it. Its search function isn’t great so it’s better to Google your question and choose the stackoverflow.com result.